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ABSTRACT 

The catalytic site identification web server provides 
the innovative capability to find structural matches 
to a user-specified catalytic site among all Protein 
Data Bank proteins rapidly (in less than a minute). 
The server also can examine a user-specified 
protein structure or model to identify structural 
matches to a library of catalytic sites. Finally, the 
server provides a database of pre-calculated 
matches between all Protein Data Bank proteins 
and the library of catalytic sites. The database has 
been used to derive a set of hypothesized novel en- 
zymatic function annotations. In all cases, matches 
and putative binding sites (protein structure and 
surfaces) can be visualized interactively online. 
The website can be accessed at http://catsid.llnl. 
gov. 

INTRODUCTION 

Not surprisingly, in the post-genomic era, the focus on 
gene products and their functions has been generally 
determined by gene sequences and their sequence similar- 
ity to previously annotated sequences (1-10). However, 
many existing annotations, especially those derived 
solely through sequence similarity, are misleading or 
incorrect (11). To better understand gene products, struc- 
tural genomic efforts have provided structures of thou- 
sands of proteins. Typically, for structurally similar 
proteins, functions are determined through template 
matching because proteins that adopt the same fold 
frequently exhibit the same function. A number of 
approaches use structural similarity (12-26) to determine 
protein function. However, sequence and structural 
homology are not sufficient to determine function of 
those proteins that have different global folds overall 
but similar functions. Thus, many proteins still do not 
have determined functions. 

Several computational approaches that determine local 
structural similarity are able to capture functional 



similarity that is missed by global structural similarity al- 
gorithms (27-32). In general, these approaches emphasize 
the development of the local similarity-matching 
approach, rather than applying it to determining 
function. For the subset of proteins that catalyze reac- 
tions, the function of the protein can be determined by 
evaluating its enzymatic reaction, specifically determining 
the catalytic residues that perform the chemistry in the 
catalytic binding site. Enzymatic function is known to be 
shared among proteins having widely divergent sequences 
because key structural similarities are preserved (33). 
Torrance et al. (34) proposed that the spatial relationships 
of key 'critical residues' could be a method for assigning 
catalytic functions among disparate proteins. This hypoth- 
esis led to the development of the Catalytic Site Atlas (35), 
which is a compendium of catalytic sites and residues 
defining those sites, along with associated Enzyme 
Commission (EC) numbers. Many groups have conse- 
quently proposed methods that more specifically search 
for catalytic motifs as a way of determining function 
(36-40), including methods that explicitly incorporate 
the ligand (41). 

Here, we report the catalytic site identification web 
server, which provides users protein annotations based 
on structural catalytic residues matched to known 
proteins with specified EC numbers. A feature of the cata- 
lytic site identification server is that it offers excellent per- 
formance in matching identified protein families through 
an EC number with as few as three amino acid residues. 

Two main challenges need to be overcome to identify 
catalytic function: (i) solved structures typically are not 
available for the protein of interest and (ii) finding 
relevant structural matches with identified function may 
be computationally expensive. Homology modeling can 
address the first issue by building a 3D model of the 
protein from a known sequence and a homologous 
protein; see, e.g. (42). The catalytic site identification 
web server provides a way to address the second issue 
by scanning for structural matches in a library of catalytic 
sites derived from protein families whose members share 
catalytic function (35). The catalytic site identification web 
server supplements these catalytic site data with 
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information on residue variation among catalytic site 
family members and also includes enzymatic site identifi- 
cations from other sources [e.g. (43)]. 

Through a web interface, the catalytic site identification 
web server allows users to enter their own catalytic sites, 
identifies and scores potential protein matches to their 
catalytic sites, and allows visual inspection. Because the 
algorithm has been developed to generalize a catalytic site 
as any binding site that the user chooses to enter, the 
catalytic site identification web server also has the capabil- 
ity to rapidly scan the universe of known protein struc- 
tures in Protein Data Bank (PDB) (44) for matches to any 
binding site. For example, after entering a specific cata- 
lytic site, the server can quickly produce a list of proteins 
that may have similar binding sites (anywhere on the 
protein) to the user-identified site, which is targeted by a 
specific drug candidate. The resulting list of proteins 
would be those proteins with a similar binding site such 
that the drug candidate could bind to an off-target 
protein, causing potential side effects (45). Additionally, 
allosteric binding sites on proteins, based on known 
binding sites, could also be identified (46). Thus, the cata- 
lytic site identification web server could be used for 
general binding site identification, depending on the 
user's questions. These applications are the focus of 
further investigations. 

MATERIALS AND METHODS 

Search procedure 

The catalytic site identification web server uses a highly 
efficient graph-based method to identify candidate 
matches to catalytic sites. The procedure treats sets of 
residues as nodes on a graph, with the distances between 
each residue its edges. To compute the distances, an 
atomic coordinate from the residue (usually the Coc atom 
of an amino acid) is chosen. A library of catalytic sites is 
pre-computed, and the catalytic site pattern is compared 
with possible patterns within the larger protein graphs. 
The procedure allows for residue substitutions in catalytic 
site identity as well. Catalytic sites are defined by the 
relative spatial coordinates of three or more 'critical resi- 
dues'. At least three residues are needed for the web server 
implementation; sites defined by fewer residues do not 
provide sufficient information for specificity in the 
search procedure. A complete description of the search 
algorithm is presented in a related work (47). 

The web server incorporates several enhancements over 
the original design to search for relatively small sets of 
unknown targets. The present web implementation 
includes elements that can accommodate searches 
through the full PDB. Several optimizations have been 
developed, including pre-calculation and efficient storage 
of data, multithreading of the search procedure and an 
improved logistic regression classifier, whose descriptors 
are more rapidly computed such that the overall process 
to perform a catalytic site search against all PDB proteins 
can be completed in less than a minute. 

The search procedure is in two stages, as shown in 
Figure 1. The first stage uses the rapid graph search 



procedure. From the graph procedure, the top 20 site- 
protein pairs are selected. The initial regression procedure 
incorporates the same distance matrix data that are used 
for the graph search. The output of the first stage is a 
refined subset of all hits that are sent to the second stage 
descriptor calculations, which require coordinate align- 
ments to compute new descriptors. The output of the 
second stage regression is the list of candidate matches. 
The regression procedures are described in the next 
section. 

Logistic regression classifiers 

An important feature of the search procedure is the use of 
a classification procedure that allows for more systematic 
identification of true positives based on a set of physical 
descriptors. These descriptors provide information beyond 
that which is provided by the initial graph matching pro- 
cedure and enhance the quality of the prediction consid- 
erably. We briefly discuss the specifics of our regression 
procedure here. 

The logistic regression function, given as 

f(z) = (l+e-T\ (1) 

where the variable z is a linear function of a set of 
descriptors 

z = A>+ Y,h(n T )-x u (2) 

i 

and an allowance for coefficients as a function of template 
size n T is made for the model used. 

The logistic function is constructed such that a larger 
value indicates likelihood that the sample is a positive case 
(a match to the reference binding site), whereas a small 
number indicates a negative case. As aforementioned, the 
catalytic site identification web server uses two logistic 
classifiers. Table 1 shows the descriptors used in each clas- 
sifier, along with the coefficient values that are used. The 
descriptors are defined in detail in the Supplementary 
Methods. The benchmarks and testing section describes 
the fitting procedure. 

The subset of candidate matches from the first-stage 
ranking that proceed to the second-stage calculations is 
determined as follows. Catalytic site-protein pairs that 
score >0.06 are accepted, with the proviso that no more 
than 100 such pairs per protein will be accepted — unless 
the pair's score is >0.35 (in which case that pair will be 
accepted). The proviso applies only in the case when 
library catalytic sites are being matched to user-specified 
proteins. 

The second-stage descriptors are included in the logistic 
regression twice: once in application to catalytic sites 
having three residues and once in application to catalytic 
sites having four or more residues. This allows training 
different coefficients for three-residue catalytic sites and 
for four-or-more-residue catalytic sites. 

Web server implementation 

The graph search (described earlier in the text) is coded in 
C++, and the auxiliary scripts for input and output 
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processing are coded in Python and Perl. Typically, the 
operating system holds in cache the pre-calculated 
distance matrices for both the library of catalytic site tem- 
plates and PDB proteins, which avoids latency caused by 
accessing data stored on disk without the need for any 
special software or hardware. 

The catalytic site identification compute server (which, 
for institutional reasons, is a different machine than the 
web server host) is a small cluster of eight compute nodes, 
as well as a master node and a database server node. Each 
compute node has 24 cores. Each catalytic site identifica- 
tion search job (i.e. a single web-server request) currently 
runs on a single node, dividing the search among the 24 
cores with parallel calculations dynamically scheduled 
with OpenMP multithreading (48). As a result, a catalytic 
site-protein search comparison averages ~6ms on the 
compute server. With the search comparisons divided 
among 24 cores, the elapsed time to search all PDB 
proteins with one catalytic site template, including the 
second-stage backbone alignments and root-mean-square 
distance calculations, is ~40 s. 

RESULTS 

Catalytic site identification web server 

The catalytic site identification web server implements 
three main functions: (i) search PDB for matches to a 
user-specified catalytic site or sites, (ii) search the web 
server's library of catalytic site templates for matches to 
a user-specified protein structure or structures (perhaps 
derived by homology modeling) or (iii) browse a 
database of pre-calculated matches between all PDB 
proteins and the library of catalytic sites, including a set 
of hypothesized novel enzymatic function annotations. In 



Table 1. Logistic regression classifiers' coefficient estimates and standard errors 



Descriptor First-stage classifier Second-stage classifier 





Coefficient (standard error) 


n T 


Coefficient 
(standard error) 


Intercept 


-12.19 (1.31) a 


all 


-5.62 (1.20) a 


1. Fraction residues correctly placed. o 


-6.27 (1.35) a 


3 


1.62 (0.55) b 


(fixed distance threshold of 0.5 A) 




4+ 


-1.33 (1.42) 


2. Fraction residues correctly placed 


5.39 (2.36) c 






(relative distance threshold of 10%) 








3. Residue-pair distance difference 


0.83 (0.44) d 


3 


-0.25 (0.13) d 
-0.82 (0.23) a 






4+ 


4. Normalized residue-pair distance 


1.40 (0.36) a 


3 


-0.24 (0.21) 


difference 




4+ 


0.20 (0.26) 


5. Position of backbone atoms 




3 

4+ 


0.96 (0.17) a 
1.14 (0.29) a 


6. Orientation of backbone atoms 




3 

4+ 


0.45 (0.12) a 
0.80 (0.16) a 



Significant at 0.1% level. 
Significant at 1% level. 
c Signficant at 5% level. 
d Significant at 10% level. 

The first-stage and second-stage classifiers use different subsets of descriptors; the second-stage classifier distinguishes 
between coefficients applied to three-residue catalytic sites and sites having four or more residues (n T ). The distance- 
difference descriptors enter the estimation as the transformed variable, d r = 1/(0.1 +d), so that smaller distance differ- 
ences are 'better' (i.e. are expected to have a positive coefficient) while avoiding singularities. 
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Figure 1. Process workflow for catalytic site identification. The logistic 
scoring and the descriptors are described in Table 1 and in 
Supplementary Methods. 
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all cases, matches and putative binding sites (protein struc- 
ture and surfaces) can be visualized interactively online. 

Input 

Search PDB for matches to catalytic sites 
Users specify a catalytic site by indicating the critical 
residues that comprise the site (which may include ions 
and cofactors). The site may be in an existing PDB 
protein, from which coordinate data will be extracted or 
in a PDB-formatted coordinate file uploaded by the user. 
Residue-type substitutions may be specified. For example, 
a site that includes a glutamic acid (Glu) may be specified 
so that a protein with an aspartic acid (Asp) at that 
location can be a match candidate. Multiple catalytic 
sites can be defined in a single search request. Uploaded 
catalytic site coordinate data files can be saved on the 
server for future use. 

Search the server's library of catalytic sites for matches 
to proteins 

Users upload a PDB-formatted protein coordinate file or 
files that can then be used to search the library of catalytic 
sites for matches. Uploaded protein files can be saved on 
the server for future use. 

Browse database of matches between PDB proteins and 
library catalytic sites 

The user inputs a PDB code, an EC number or a partial 
EC number to search PDB for proteins that match the 
catalytic site library. 

Output 

In all three cases, the output is a list of matches between 
proteins and catalytic sites, ordered with the highest- 
scoring match first. The results are presented in a table 
format as shown in Figure 2, which is the result of 
browsing for catalytic sites matching PDB protein ldeo. 
Starting from the left, the first column, 'View', provides a 
button to open the visualizer, as described later in the text. 
The second column, 'Score', shows the resulting match 
score as discussed under 'Benchmarks and testing'. 
Evaluation of the scoring performance indicates that 
matches with scores >0.02 are a good indication of a 
'positive', i.e. a likely correct assignment of catalytic 
function. The third column, 'Catalytic site', identifies the 
matching structures. The final characters of the catalytic 
site identifier — after the last hyphen — indicate a particular 
binding site for multimeric proteins, which may have their 
catalytic function on multiple chains; inclusion of the 
alternate binding sites in the web server's library allows 
for structural variation. The fourth and fifth columns, 
'Catalytic site EC number' and 'Catalytic site EC label', 
indicate the catalytic function associated with the catalytic 
site template structure. The last two columns on the right, 
'Catalytic site UniProt EC number' and 'Catalytic site 
Uniprot EC label', show the annotation of the catalytic 
site template (column 3) provided in Uniprot (49). A link 
allows the data to be downloaded in comma-separated- 
values format for import to a spreadsheet. 



Visualization 

Each of the resulting matches is available for visualization 
on the web server with the Jmol viewer (50) with a click of 
the 'View' button. Figure 3 shows two of the catalytic site 
matches to protein PDB ldeo, as listed in Figure 2. These 
results are discussed further in the 'Sample Uses' section 
later in the text. Additional visualization options are avail- 
able on the server, including whether the protein surface is 
shown, whether the surface is transparent or opaque and 
whether co-crystallized ligands are shown. Further 
options are available via Jmol's menu and command line 
interface. 

Benchmarks and testing 

Training and test data sets 

Non-overlapping samples of PDB proteins were drawn to 
'train' the logistic regression classifier (i.e. estimate the 
coefficients of the logistic regression function) and to 
'test' the classifier on data that are independent of that 
used in the training. The training data sample consists of 
approximately one-tenth of PDB proteins, filtered to 
include only those proteins annotated with EC numbers. 
Candidate matches include 53 088 catalytic site-protein 
pairs. Of these pairs, 52473 matches were with catalytic 
site templates having three critical residues, and 720 
matches were with catalytic site templates having four or 
more critical residues. Of the 53 088 candidate matches, 
503 catalytic site-protein pairs were positives, i.e. correct 
matches — for all four parts of the EC number (class, 
subclass, sub-subclass and serial number) — between the 
catalytic site template EC number and the protein EC 
number. This definition of 'positive' was chosen to 
provide the most specific definition of 'success' for 
purposes of training the classifier. Of the positive results, 
258 matches were with catalytic site templates having three 
critical residues; 245 matches were with catalytic site tem- 
plates having four or more critical residues. The resulting 
trained classifier coefficients are shown in Table 1. 
Generally, to avoid overfitting, descriptors were retained 
in the regression when they contributed to a parsimonious 
specification according to the Akaike information criter- 
ion (51). 

The test data set consists of a different sample of PDB 
proteins. The sample includes 27436 catalytic site-protein 
pairs, 27 008 of these with three-residue catalytic site tem- 
plates and 428 with four-or-more-residue catalytic site 
templates. There are 342 positives in the data set, 221 of 
these with three-residue catalytic site templates and 121 
with four-or-more-residue catalytic site templates. 

Receiver Operating Characteristic curves 
Receiver Operating Characteristic (ROC) curves illustrate 
the performance of a binary classifier. The catalytic site 
identification classifier should distinguish matches 
between proteins and catalytic sites that share catalytic 
function ('positives') from matches between proteins and 
catalytic sites that do not share catalytic function ('nega- 
tives'). How the classifier makes this distinction depends 
on the score threshold used. A high threshold may 
correctly identify positives ('true positives') and exclude 
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Matches between catalytic sites and protein 1deo 



View 


Score 


Catalytic 
site 


Catalytic 
site 

EC number 


Catalytic site EC label 


Catalytic site 
Uniprot 
EC number 


Catalytic site Uniprot 
EC label 


View 


0.694 


1dd4-1 


3 01 01 0086 


Rhannnorialartijronan 

1 11 1 CX MM 1 \J V_j CI 1 ClV-/ l m 1 \J 1 1 CI 1 1 

acetyl esterase 


3 01 01 0086 


Rhanrinonalarturonan 

1 11 1 CI 1 1 II IUUuIQvLUI vl 1 CI 1 1 

acetylesterase 


View 


0.618 


1 pp4-0 


3.01.01.0086 


Rhamnogalacturonan 
acetyl esterase 


3.01.01.0086 


Rhamnogalacturonan 
acetylesterase 


View 


0.149 


1 bwp-0 


3.01.01.0047 


Platelet-activating factor 
acetylhydrolase 


3.01 .01 .0047 


Platelet-activating factor 
acetylhydrolase 


View 


0.053 


1j00-0 


3.01.02.0000 


Thioesterase i 


3.01.02.- 


Acyl-CoA thioesterase 1 


View 


0.053 


1J00-1 


3.01.02.0000 


Thioesterase i 


3.01.02.- 


Acyl-CoA thioesterase 1 



Show hits below cutoff score (65 hits) Download in comma-separated-values format 
Figure 2. Sample output — browsing catalytic site matches to PDB ldeo. These are the top-scoring matches within a score threshold of 0.02. 




Figure 3. Visualization of matches between protein PDB ldeo and catalytic sites, (a) The aligned matching critical residues in protein ldeo (blue) 
and catalytic site lpp4-l, 3.1.1.86, Rhamnogalacturonan acetylesterase (magenta). Residue names and numbers are identical between the protein 
and the catalytic site, (b) The aligned matching critical residues in protein ldeo (blue) and catalytic site ljOO-0, 3.1.2, Thioesterase I (magenta). 
The crystallographers' modification of SER10 with a simulated substrate moiety is not shown here for clarity. 



negatives ('true negatives'), but at the cost of identifying 
only a portion of all positives. A lower threshold will find 
more true positives, but at the cost of incorrectly identify- 
ing some negatives as positive ('false positives'). An ROC 
curve uses proteins with known catalytic function to plot 
true positives as a fraction of all positives in the data set 
('true positive rate') versus false positives as a fraction of 
all negatives ('false positive rate'), both as a function of 
the score threshold value. 

Before constructing ROC curves for the training 
and test data sets, 'duplicate results' were deleted. 
Occasionally, there are multiple correct matches 
('positives'), such as additional binding sites on a 
multimeric binding site. Similarly, there may be multiple 
incorrect matches ('negatives'). To avoid overstating 
either the true positive rate or the false positive rate in 
constructing the ROC curves, only the highest-scoring of 
such duplicate matches is retained in the test data set. The 
web server also presents the results this way: only the 
highest scoring of such duplicate matches is presented. 

The performance of the classifier on the training data 
set and test data set was analyzed through ROC curves 
and Matthews correlation coefficients (MCC). The 'area 
under the curve' (AUC) for ROC curves can serve as an 



indicator of the classifier's discrimination. An ideal classi- 
fier would correctly identify all of the positives (true 
positive rate equals 1.0) without incorrectly identifying 
any negatives as positive (false positive rate equals 0). 
Such an ideal ROC would have AUC equal to 1.0. The 
MCC provides an indication of the performance of the 
classifier as a function of the threshold score chosen as 
the value to distinguish between putative positive and 
negative results. 

The area under the ROC curve for the training data set 
is 0.94, as shown in Figure 4a. The AUC for the test data 
set is 0.89. Although the MCC for the training set shows a 
peak at a probability score threshold of ~0.55, the curve is 
broadly flat down to a threshold value close to zero, at a 
true positive rate of ~0.85 (see Figure 4b); the correspond- 
ing false positive rate at that threshold is close to zero (see 
Figure 4a). The data indicate that a true positive rate of 
0.85 is achieved with a false positive rate of 0.004 at a 
threshold value of 0.02. As a practical matter, hits with 
a score of ~0.02 and above appear to be of interest, as the 
case study below illustrates. The test set curves (Figure 5) 
indicate similar conclusions, though the true positive rate 
is somewhat lower, ~0.79, and the false positive rate 
slightly higher, 0.010, at the 0.02 threshold. 
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Figure 4. Performance of the logistic regression classifier on the training data, (a) ROC curve. AUC is 0.94. (b) MCC in blue and true positive rate 
(TPR) in red versus score threshold. The classifier shows good performance, as 85% of the matches are correctly identified with a false positive rate 
of only 0.4%. 
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Figure 5. Performance of the logistic regression classifier on the test data set. (a) ROC curve. AUC is 0.89. (b) MCC in blue and true positive rate 
(TPR) in red versus score threshold. The classifier's performance on the test data, which are distinct from the data used to train the classifier, is close 
to its performance on the training data — 79% of the matches are correctly identified with a false positive rate of only 1.0%. 



Sample uses 

Browsing catalytic site matches to a PDB protein 

The web server allows users to search its database for 
pre-calculated matches between library catalytic sites 
and PDB proteins. Figure 2 shows the results of 
searching for matches to PDB ldeo. The top-scoring 
results within a threshold score of 0.02, as suggested 
earlier in the text in 'Benchmarks and testing', are 
shown. Although ldeo does not have an EC annotation 
in its PDB file, the crystallographers have identified 
ldeo as a rhamnogalacturonan acetylesterase (52), 
which according to the IntEnz website (53) corresponds 
to EC 3.1.1.86. The top two matches from the catalytic 
site identification web server are both from PDB lpp4, 
one match to each of the catalytic sites in the two 
chains, and correctly identify ldeo as EC 3.1.1.86. 
Inspection of the catalytic site in Figure 3a confirms 



that the residues are well aligned. The third match is 
to EC 3.1.1.47, which differs only in the fourth figure, 
indicating the substrate of the reaction is different. The 
3.1.1 family of enzymes are carboxylic ester hydrolases 
that cleave the acetyl group from an acetylester. This 
chemistry is conserved in both cases, with the only dif- 
ference being the particular molecule being cleaved. The 
final two matches are to EC 3.1.2.0000, which is a 
thioesterase (Thioesterase I). Thioesters are closely 
related to carboxylic esters. The catalytic functions are 
also similar with the difference being the cleavage is at a 
sulphur site adjacent to a carbonyl group (a thioester) 
rather than an oxygen site adjacent to an acetyl group. 
Figure 3b displays the resulting binding site match. The 
crystallographer's modification of SER10 of the catalytic 
site, PDB ljOO, is not shown in Figure 3b to make the 
comparison with Figure 3a easier (54). 
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Table 2. Top 10 results of browsing protein matches to catalytic sites in EC sub-subclass 4.2.1 
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Carbonate anhydrase 


0.957 


lqrf 


lqrg-0 
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Carbonic anhydrase 


4.02.01.0001 


Carbonate anhydrase 


0.948 


lqre 


lqrg-0 


4.02.01.0001 


Carbonic anhydrase 


4.02.01.0001 


Carbonate anhydrase 


0.927 


lqrl 
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0.548 


1£93 


ldco-4 


4.02.01.0096 


Dcoh 


4.02.01.0096 


Pterin-4-alpha-carbinolamine 














dehydratase 
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ldub-0 


4.02.01.0017 


2-enoyl-coa hydratase 







Browsing matches to proteins currently without 
annotation in PDB 

The web server also allows users to search by EC number 
for proteins that do not have an EC annotation in their 
PDB file. Table 2 shows the top-scoring results of a search 
for matches between PDB proteins and catalytic sites in 
the EC 4.2.1 sub-subclass ('Hydro-Lyases'). Although the 
results include only those proteins that do not have EC 
annotations, many proteins do have EC annotations in 
their UniProt record, as shown in Table 2. 

Not surprisingly, the top-scoring matches are matches 
between proteins and their own catalytic sites or catalytic 
sites from closely related proteins. In these cases, the 
UniProt annotations agree with the web server's identifi- 
cation in all four EC figures. The UniProt confirmation of 
the four EC figures illustrates that the web server makes 
reliable functional identifications. Thus, the remaining 
imputed functional identifications should be worth 
further consideration. 

The web server identifies PDB 2wtb as a '2-enoyl-coa 
hydratase'. Although 2wtb does not have an EC annota- 
tion in either PDB or UniProt, the crystallographers have 
identified 2wtb as having 2-trans-enoyl-CoA hydratase 
activity (55). Another match is with PDB 2qq3, which 
also does not have EC annotations and is also identified 
by its crystallographers as having 2-trans-enoyl-CoA 
hydratase activity (56). Inspection of the binding sites 
reveals convincing similarity between the binding sites. 
This example, using EC 4.2.1, reveals that the search 
procedure can sometimes provide an additional way of 
identifying similarities that can serve to complement 
incomplete annotations in PDB and UniProt. 

Using a novel catalytic site to find matches 
throughout PDB 

To illustrate the capability to search throughout PDB 
using a user-defined catalytic site, the plasminogen activa- 
tor of Yersinia Pestis is used from recent studies (57,58). 
This protein is a member of the omptin family of prote- 
ases, and the EC number reported by the study for this site 
is EC 3.4.23.48, which is not currently in the server's 
library of catalytic sites. The highest resolution structure 
for this protein is PDB 2x55 (1.85 A), and the most recent 



crystal structure is PDB 4dcb. Catalytic sites were ex- 
tracted from PDB 2x55 and PDB 4dcb, using residues 
(Asp|Asn)84, (Asp|Asn)86, (Asp|Asn)206 and His208, as 
identified in (58). Both sites were used to explore the 
effects of spatial variations. 

The top protein matches to the input sites include the 
proteins from which the catalytic sites were drawn, as well 
as the related structure, 2x56 (see Table 3). Although the 
next matching protein, li78, is annotated with EC 
3.4.21.87 in PDB, this EC number has been reassigned 
in UniProt as EC 3.4.23.49 (omptin) (59), a more 
general endopeptidase that is not specific to the plasmino- 
gen Arg560-Val561 peptide bond. This similarity is well 
within what would be a considered a typical correct 
match. 

The next two matches, PDB 3vc5 and 3vc6, are not yet 
annotated in PDB. The PDB record currently reports isom- 
erase activity. These enzymes appear to be part of the 
enolase (ES) superfamily (11,60), but one notable differ- 
ence is the absence of a divalent metal ion, which is 
required for the initial proton abstraction step. The 
absence of this ion points to the possibility of another cata- 
lytic function for this enzyme. Viewing the aligned catalytic 
site (Figure 6a) shows reasonably close superposition of the 
catalytic residues to PDB 2x55. The amide cleavage ma- 
chinery appears to be present in the 3vc5 site. However, 
these residues are more buried in the 3vc5 site, which 
suggests that endopeptidase activity (of which plasminogen 
activation is a specific case) is not the likely function. A 
different amide bond cleavage mechanism may be possible. 
ESs are known to host small peptide substrates, as is the 
case with the dipeptide isomerases (61,62). The residues 
appear to be reasonably placed in the known catalytic 
region of ESs for a mechanism of this type. 

The next match, PDB lbqg, has EC 4.2.1.40, 'glucarate 
dehydratase', which is also a member of the ES superfam- 
ily. For lbqg, the residues matching those of the plasmino- 
gen activator site are in similarly good alignment (see 
Figure 6b) and are also in the known ES catalysis site. 
However, the function of lbqg is well known to be 
mandelate racemase (MR) (63). The web server's results 
are consistent with this: entering lbqg as a search protein 
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Table 3. Top 11 matches across PDB to plasminogen activator binding sites 



Score 


Catalytic site 


Protein 


Protein PDB EC (if known) 


Protein PDB EC label 


0 QQR 


ZXDD 


zxj j 




Plasminogen activator 


0.998 


4dcb 


4dcb 


3.4.23.48 


Plasminogen activator 


0.962 


2x55 


2x56 


3.4.23.48 


Plasminogen activator 


0.895 


2x55 


4dcb 


3.4.23.48 


Plasminogen activator 


0.893 


4dcb 


2x55 


3.4.23.48 


Plasminogen activator 


0.696 


4dcb 


2x56 


3.4.23.48 


Plasminogen activator 


0.552 


2x55 


H78 


3.4.21.87 


Omptin [3.4.23.49] 


0.368 


4dcb 


li78 


3.4.21.87 


Omptin [3.4.23.49] 


0.063 


2x55 


3vc5 






0.056 


2x55 


3vc6 






0.055 


4dcb 


lbqg 


4.2.1.40 


Glucarate dehydratase 



( 3 ) ^^ASN219 

ASP86 



ASP84 




HIS345 



Figure 6. Visualization of match between plasminogen activator site and possible proteolytic sites in isomerases. (a) The aligned matching critical 
residues in catalytic site 2x55, 3.4.23.48, Plasminogen activator (magenta) and protein 3vc5 (blue), (b) The aligned matching critical residues in 
catalytic site 4dcb, 3.4.23.48, Plasminogen activator (magenta) and protein lbqg (blue). 



correctly returns matches to the lec7 ES and MR binding 
sites. 

Since the MR functionality is well established for lbqg, 
we propose that the protease site is a possible second 
function for lbqg, as well as other ESs that appear in 
the search. Experimental verification and further study 
would be needed to confirm these potential function pre- 
dictions for both these enzymes. 



DISCUSSION 

The catalytic site identification web server offers users 
several options for quickly exploring potential catalytic 
functions of novel or uncharacterized proteins, or for 
finding proteins that currently have not been identified 
as having a particular catalytic activity. In addition, 
because the server can generalize a catalytic site as any 
binding site that the user chooses to enter, the server 
should also have uses beyond catalytic function identifica- 
tion, such as in the discovery of off-target drug inter- 
actions or allosteric binding sites. 

In the near term, a number of enhancements to the 
web server will be explored, such as additional browsing 
capabilities and user-customization of search param- 
eters. In the future, we would like to increase the 



server's coverage of the 'enzymatic universe' by adding 
to the server's library of catalytic site templates. An 
additional goal is to extend the server's capabilities so 
that it becomes an element of a protein function predic- 
tion 'pipeline'. To allow users to start with only a 
protein sequence, the pipeline generates homology 
models of protein structure(s) based on a sequence 
and secondary structure template library [see (42)]. 
The resulting homology models are directed to the cata- 
lytic site identification server to identify potential cata- 
lytic sites and their function. These candidate catalytic 
sites could be verified through docking calculations, 
where the metabolites, suggested by the proposed EC 
number identifications, would each be tested for 
binding to the candidate sites [see (64)]. Used along 
with sequence-based methods, such an approach would 
provide independent lines of evidence that would go 
some way toward addressing the difficult task of 
protein function prediction. 



SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Online: 
Supplementary Methods and Supplementary References 
[47,65-67]. 
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